Machine learning-based meta-analysis of colorectal cancer and inflammatory bowel disease

Colorectal cancer (CRC) is a major global health concern, resulting in numerous cancer-related deaths. CRC detection, treatment, and prevention can be improved by identifying genes and biomarkers. Despite extensive research, the underlying mechanisms of CRC remain elusive, and previously identified biomarkers have not yielded satisfactory insights. This shortfall may be attributed to the predominance of univariate analysis methods, which overlook potential combinations of variants and genes contributing to disease development. Here, we address this knowledge gap by presenting a novel multivariate machine-learning strategy to pinpoint genes associated with CRC. Additionally, we applied our analysis pipeline to Inflammatory Bowel Disease (IBD), as IBD patients face substantial CRC risk. The importance of the identified genes was substantiated by rigorous validation across numerous independent datasets. Several of the discovered genes have been previously linked to CRC, while others represent novel findings warranting further investigation. A Python implementation of our pipeline can be accessed publicly at https://github.com/AriaSar/CRCIBD-ML.


Introduction
Colorectal cancer (CRC) ranks as one of the top three deadliest cancers worldwide, with an estimated 1.8 million cases and 881,000 fatalities in 2018 alone [1].Timely detection of CRC can significantly improve prognosis and reduce mortality rates [2].When CRC is diagnosed in individuals below the age of 50, it is referred to as early-onset CRC (eoCRC).Over the past few decades, the epidemiology of eoCRC has been subject to change, as reported by numerous studies.Starting from the 1990s, there has been a rise in the incidence of eoCRC across the world, including both high-and low-income countries [3,4].The rate of increase in eoCRC incidence is accelerating and is predicted to pose a significant public health challenge [3,4].Recently, the US Preventive Services Task Force recommended lowering the average-risk population screening age to 45 years [5,6].Possible justifications for the increasing incidence of eoCRC include a westernized diet, including red and processed meats; consumption of monosodium glutamate, titanium dioxide, high-fructose corn syrup and synthetic dyes; obesity; stress; and widespread use of antibiotics [7].
Due to its heterogeneity, CRC is controlled by many genes and environmental factors.Epigenetics refers to alterations in gene expression or function without changes in DNA sequence.Primary epigenetic modifications include DNA methylation, post-transcriptional modifications of histone and non-coding RNA-mediated changes of gene expression [8].Despite its significant recognition, the contribution of epigenetic events to cancer evolution needs further investigation [9,10].It is believed that the modifications in epigenetics and the changes in the expression of non-coding RNAs can be utilized as biomarkers for the diagnosis, prediction of treatment response and prognostication in the case of CRC [11].The genetic and epigenetic modification of cancer-associated genes occurs independently but recurrently in CRCs, and that epigenome alterations probably control important tumour cell phenotypes, including escape from immune surveillance [12].
Recent studies have provided important insights into the molecular mechanisms that underlie the formation of CRC.The majority of CRC cases (75%) are sporadic, while the remaining cases are either linked to inflammatory bowel diseases (IBD) or have a familial origin [13].It is estimated that the process of CRC tumorigenesis is slow, taking almost two decades for a tumor to form [14].Despite extensive research efforts and the elucidation of some pathways and genes, a considerable unknown portion of these diseases persists.In particular, the dynamics and complex process of cancer cell invasion and metastasis is poorly understood [15][16][17].
Oncogenic transformation in CRC is known to be caused by the driver genes APC, KRAS, SMAD4, and TP53, which modulate global translational capacity in intestinal epithelial cells [18].Given our present understanding of the intricate nature of cancer genomes, how cancer cells evolve over time under treatment, and how inhibiting targets affects the body, it is now advisable to move away from the one gene, one drug approach and embrace a 'multi-gene, multi-drug' model for making informed decisions regarding therapy [19].In other words, the unidentified aspect of the disease may stem from the cumulative effects of multiple low-penetrance genes, which together pose a substantial risk [20].To that end, there have been considerable interest in molecular subtype classification of CRC using gene expression data, hierarchical clustering, and machine learning [19,[21][22][23].
Machine learning (ML) techniques have demonstrated their efficacy in addressing biological queries [24,25].Owing to their notable accomplishments, the application of ML methods to biological data is expanding, revealing their considerable potential in tackling genetic problems such as the imputation of missing SNPs and DNA methylation states, disease diagnosis [24,26], antibody development [27], and numerous other areas.The use of ML has demonstrated great potential in enhancing our comprehension of cancer dynamics, and it holds the possibility of substantially transforming our understanding of cancer dynamics by revealing fresh insights into the molecular mechanisms that drive cancer progression and impact treatment response [28,29].
In this paper, our primary goal is to investigate the genetic landscape that underlies the progression of both CRC and IBD, with a specific emphasis on additive gene interactions.We employed novel ML algorithms trained on case-control datasets from the GEO (Gene Expression Omnibus) database, consisting of 566 CRC cases and 262 controls.Through this process, we identified a subset of prominent genes capable of cumulatively distinguishing between CRC and control samples.To demonstrate the efficacy of our selected genes, we conducted validation using the top 40 genes on multiple external and independent datasets.Remarkably, our model accurately classified 1807 out of 1860 cases and 115 out of 119 controls, highlighting the strength of our approach.Although some of the chosen top genes were already known and studied in the literature, it is noteworthy that building a model solely based on just those well-known genes did not yield satisfactory validation results, leading to numerous misclassification.
Additionally, recognizing the heightened risk of CRC development in IBD patients, we also set out to identify a subset of genes capable of distinguishing between IBD cases and healthy controls.To accomplish this, we trained ML algorithms on GEO datasets comprising 288 IBD cases and 76 controls.Using the top 100 selected genes, our validation on external IBD datasets led to the correct classification of 212 out of 231 IBD cases and 51 out of 54 healthy controls.We note that the misclassified samples included 9 inflamed IBD samples that were misclassified as healthy.
To bridge CRC and IBD, we constructed a gene network using the STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) platform, revealing direct interactions between IBD and CRC genes, highlighting GAPDH's pivotal role.Our study recommends a closer examination of oncogenes TNS4, SLC7A5, and SCD within the context of the nuclear receptors meta-pathway.Furthermore, genes SLC7A5, SCD, GAPDH, and SDF2L1 are implicated in the mTOR signaling pathway, underscoring the need for more investigation.These findings hold the promise of deepening our comprehension of the genetic mechanisms underlying CRC and IBD.

Data retrieval and curation
We combined 6 gene expression datasets from the GEO (Gene Expression Omnibus) database to form a training dataset, and an additional 14 different gene expression datasets were selected for validation.Some of the validation datasets contain only cases.A comprehensive summary of each CRC dataset is presented in the S1 Table for further reference.We carefully examined all the validation datasets to make sure there is no overlaps or leakage with training datasets.For instance, dataset GSE32323 was omitted due to a high probability of containing identical patients (with differing expressions) as those in dataset GSE21510.Additionally, to enhance reliability, gene expression samples were grouped based on geographical similarity (country/ city) within either the training or validation sets as much as possible.It is important to note that datasets GSE68468, GSE103512 and GSE2109 encompass samples derived from various organs in addition to the colon and rectum; however, in our analysis exclusively, we only included samples derived from colon and rectum.To be able to merge the training datasets together and then perform the validation, we only kept genes that are common between all training and validation datasets.In the end, all the CRC datasets had uniformly 10,113 genes.
We selected 5 gene expression IBD datasets from GEO for training and 6 different gene expression IBD datasets for validation.We included only those samples who had not undergone any specific treatment.Additionally, we excluded datasets consisting of blood samples.Dataset GSE16879 includes pre-and post-infliximab treatment samples, of which we selected only the pre-treatment ones.In GSE59071 and GSE48958, inactive samples were excluded.From GSE179285, we included only inflamed samples and from GSE4183, only IBD and normal samples were chosen.From GSE37283, patients diagnosed with ulcerative colitis with neoplasia were retained.The S2 Table contains details of the datasets used for IBD.After discarding genes that are not common between all IBD datasets, all IBD datasets had uniformly 16,413 genes.

ML models, evaluation, and performance metrics
One of the most pivotal elements of any ML model applied to genomic data is feature selection which is tasked with uncovering the most disease-relevant genes among thousands.The detection of low-penetrance genes related to CRC and IBD necessitates the use of wrapper or hybrid feature selection techniques.However, due to the computational complexity, wrapper methods are not feasible for high-dimensional datasets like those employed in this paper.In this research, we utilized SVFS (Singular Vectors Feature Selection), a hybrid feature selection method that has recently demonstrated superior results compared to other methods on gene expression data [30].SVFS is a method designed for high-dimensional datasets.Given a matrix A with its Moore-Penrose pseudo-inverse A † , it is shown in [30,31] that the projector P A = I − A † A partitions features into clusters based on their correlations.Initially, SVFS identifies and retains only those features that correlate with the class label, discarding others as irrelevant.In the subsequent step, it further clusters the remaining features and selects the most significant ones from each cluster.
As we can see from the S1 and S2 Tables, the number of cases is much more than the number of controls.To avoid developing biased models, we employed SMOTE [32] (synthetic minority oversampling technique) to use the controls in the training datasets and generate synthetic controls; this way, we equalize the number of cases and controls within the training dataset.We run the SVFS on the training dataset to select the first 100 most important genes, as shown in the S3 and S4 Tables.
To build an ML model, we employed an ensemble classifier consisting of Random Forest (RF), Logistic Regression (LR), and Support Vector Machine (SVM).Each RF, LR, and SVM are based on different algorithms, and their ensemble provides greater robustness than single models and decreases the potential for overfitting.The ensemble classifier was trained on the reduced training dataset (using only the most important genes) to build a model.Finally, the model was evaluated on all validation datasets.We employed accuracy, the precision-recall curve, and the confusion matrix to report validation results.Fig 2 illustrates the validation results for each CRC case-control validation dataset.We note that using only the first 40 genes for model generation, the model comes close to achieving optimal performance on all validation sets.Confusion matrices and precision-recall curves are generated based on these 40 genes.
Several of our datasets consisted of cases only.For case-only datasets, we adopted recall (also known as sensitivity or true positive rate), representing the proportion of correctly predicted case samples relative to the overall number of cases.In order to achieve reliable results from ML algorithms, it should be noted that the validation datasets must not be utilized at any point during the training or model generation process.Given the validation results in Figs 2 and 3, we deduce that using 40 prominent identified genes, our model could diagnose 1807 cases out of 1860 and 115 controls out of 119.Some of the previously reported significant genes include TP53, APC, KRAS, MGMT [33], SMAD2 [34] and SMAD4 [34].It is interesting to note that if we build a model just based on these wellknown genes, we do not get acceptable validation results.Indeed, we implemented a supplementary pipeline using only TP53, APC, KRAS, MGMT, SMAD2, and SMAD4, and it turned out that 100 controls out of 103 are misclassified (for this experiment, we had to exclude GSE38026 because the KRAS gene does not exist in this dataset).The detailed results are presented in the S1 Fig.We observe that all components of the supplementary pipeline, including oversampling, remain unchanged from the original pipeline.The sole distinction lies in the selection of genes.In essence, the supplementary pipeline was derived from our original one by substituting the top 40 genes with 6 well-known genes from the literature.
The same methodology was employed for IBD; that is, SVFS was utilized on the training IBD dataset, and the first 100 significant genes were selected, as shown in the S4  healthy samples in GSE9452, GSE37283, GSE4183 and GSE48958.In the case of GSE36807, the model accurately diagnosed all healthy samples, though nine inflamed samples were misclassified as healthy.Overall, the classifier exhibited strong performance, suggesting an acceptable identification of IBD-related genes by identifying 212 IBD cases out of 231 and 51 healthy controls out of 54.
In our analysis, as depicted in Figs 2 and 4, the CRC model displayed approximate optimal performance upon considering the top 40 genes, while the IBD model showed some fluctuations.This variation can be attributed to the smaller validation sample size in the IBD model, which means minor misclassifications can significantly alter its performance curve.Furthermore, we suspect that the classification of IBD may inherently be more complicated, which could account for the observed variations in performance.

Analyzing gene interactions
In order to discover and comprehend the inherent interactions among the identified genes, we utilized STRING (Search Tool for the Retrieval of Interacting Genes/Proteins).We constructed a gene network by integrating the top 50 IBD genes with 50 CRC genes in the initial step, setting the interaction score to medium confidence.intermediary gene.This gene and its adjacent genes might be fundamental to IBD and CRC progression.The intermediary genes are highly likely to contribute to the disease due to their strong connections to genes in our subset and, importantly, their bridging functions.We further explored the COSMIC (Catalogue Of Somatic Mutations In Cancer) database to determine if any of the genes in our subset had been previously reported as having a strong association with CRC.Tissue selection, Sub-tissue selection, Histology selection, and Sub-histology selection were set to Large intestine, Include all, Carcinoma, and Adenocarcinoma, respectively.Remarkably, TP53, SMAD4, RNF43, CTNNB1, and PTEN ranked among the top 20 most frequently mutated CRC-related genes listed in COSMIC.On the other hand, several of our reported significant genes were not on the COSMIC list and did not receive adequate attention from researchers.
We also performed GSEA (Gene Set Enrichment Analysis) using most first 50 identified CRC genes.As shown in Table 1, GSEA showed that TNS4, SLC7A5, and SCD are involved in the nuclear receptors meta-pathway.Genes SLC7A5, SCD, GAPDH and SDF2L1 were involved in the mTOR signalling pathway.Also, several other gene sets were associated with cell cycle regulation and transcription regulation.Given the limited understanding of the underlying mechanisms of CRC and IBD, we propose to consider other important genes that are not part of the network in Fig 5 .For instance, an in-depth investigation of TNS4, GAPDH, L1CAM, GAL, CRYAB, IRF7, GPN1, TMEM39B, EZR, and all other genes referred to in the S3 and S4 Tables are needed to discern their role in CRC and IBD.

Discussion
One of the customary approaches to find potential biomarkers for a disease is to identify genes that are differentially expressed (DEG) in cases and controls [35][36][37][38].Such methods usually set a threshold to identify DEGs, and as such many genes are filtered out even if they were close to the threshold.Even though one can validate whether a candidate gene is DEG on an external dataset, usually, no single gene can be the sole cause of cancer, and there is no methodology to quantify the relation of a subset of DEGs to cancer.In contrast, machine learning and feature selection algorithms, when validated on external datasets, can identify promising biomarkers.
In this study, we identified several potential genes correlated with CRC using multiple multivariate machine-learning methods.Some of these genes have not received enough attention in previous studies.Our findings provide new insights into the potential additive genes correlated with CRC and IBD, as patients with IBD are at high risk of developing CRC.Identifying novel genes correlated with CRC and IBD provides a foundation for future research into the underlying mechanisms of these diseases.
Notably, TNS4 (CTEN) emerges as a critical gene associated with CRC.Despite previous studies highlighting TNS4's association with CRC, its significant contribution has been relatively overlooked.This gene has been identified as an oncogenes gene in several studies.It has been argued that TNS4 plays a pivotal role in CRC tumorigenesis, suggesting that TNS4 suppression could represent a promising therapeutic approach [39] and its knockdown improves sensitivity to Gefitinib [40].It has been revealed that TNS4 interacts with MET signalling [41], a process known to enhance motility, cancer cell survival, and angiogenesis [42].This interaction between TNS4 and MET is direct, and TNS4 plays a role in maintaining MET stability, thereby supporting cancer cell survival [41].
We also identify GAPDH is another potential CRC-related protein-coding gene.Not only does it connect a significant number of CRC and IBD-related genes (Fig 5(b)), but it also ranks among the top 40 most prominent CRC-related genes given in the S3 Table .Investigations have examined GAPDH's interaction with mutated KRAS and BRAF, suggesting that GAPDH suppression via vitamin C may disrupt tumor growth [43].Other work has analyzed tumor versus non-tumor pairs in 195 cases, identifying substantial overexpression of GAPDH in CRC cases [44].Researchers have also observed significant upregulation of GAPDH in CRC, indicating its potential value in early CRC detection [45].
SLC7A5 is identified as the second most significant gene on our list (S3 Table ).Najumudeen et al. conducted comprehensive research on SLC7A5's correlation with CRC [46].They proposed that SLC7A5 might offer potential therapy for KRAS-mutant CRC unresponsive to other treatments [46].Additionally, Huang et al. identified SLC7A5 as one of the five key genes involved in the ferroptosis of colon cancer cells [47].
HIST3H2A, HGD, GPN1, COL21A1, and SDF2L1 appear as potential novel biomarkers using out results.While there is evidence linking HIST3H2A to lung cancer [48] and pancreatic cancer [49], its association with CRC remains unverified.Yi et al. found a significant association between HGD and rectal cancer [50], but no other research has established a strong link between HGD and CRC.To our knowledge, GPN1 is a novel gene identified through our work.Presently, limited information is available on this crucial gene, warranting further investigation for a more comprehensive understanding.As per our review, Li et al.'s study stands as the only research that has identified COL21A1 as a possible diagnostic marker [51].Despite examining the link between the SDF2L1 gene and different cancer types, including Nasopharyngeal Carcinoma [48], its potential role as a CRC marker has not been acknowledged.
Cruz-Gil et al. identified SCD as a critical component of lipid metabolism in CRC [52].The relationship between SCD and ACSL increases the risk of relapse in CRC patients [52].Furthermore, Liao et al.'s research suggests a connection between overexpressed SCD-1 and advanced CRC [53].
CRYAB has been linked to various cancer types [54], and multiple studies have explored its association with CRC.Deng et al. characterized CRYAB as a tumor-suppressor gene and a potential diagnostic marker [55], and Shi et al. verified its function as a prognostic CRC biomarker [54].Dai et al. also proposed CRYAB as a promising target for CRC therapies [56].
Another dimension of importance is the role of non-coding RNAs in CRC.It is believed that modifications in epigenetics and alterations in the expression of non-coding RNAs can be harnessed as biomarkers for CRC diagnosis, prognostication, and treatment response prediction [11].Non-coding RNAs, especially microRNAs and long non-coding RNAs, play significant roles in gene expression regulation and are implicated in several CRC pathways [57][58][59].For instance, RAMS11, a non-coding RNA, is highlighted for its regulation of topoisomerase IIα (TOP2α), underscoring its potential value as a biomarker and therapeutic target for metastatic CRC [60].The increased expression of lncRNA IGFL2-AS1 in CRC tumor tissues and cells [61] further supports this perspective.By concentrating predominantly on coding regions of DNA, we might have overlooked these crucial actors in CRC's genetic landscape.Incorporating both coding and non-coding DNA segments could provide a more comprehensive insight into CRC's genetic intricacies, thereby shaping its diagnosis and treatment avenues more effectively.
Validation results on CRC and IBD datasets provided us with a set of genes that are deemed important in the development of CRC and the transition of IBD into CRC.Our findings offer new insights into the potential additive genes correlated with CRC and IBD and emphasize the value of machine learning algorithms in identifying genes that may contribute to CRC and IBD development.Nonetheless, our work here has certain limitations, such as the utilization of individual intermediary genes to establish connections between significant genes associated with IBD and CRC.We also note that many genes were removed at the initial stage of our preprocessing procedure to obtain an identical subset of genes across all datasets.
Further studies focusing on genes' additive functions instead of single-variate analyses are necessary to confirm these genes' contributions.The potential significance of these findings for clinical applications includes the possibility of developing better prevention, detection, and treatment methods for CRC and IBD patients.

Data preprocessing
The presence of missing values in data can disrupt numerous machine learning algorithms' functionality, potentially leading to biased outcomes, inaccurate predictions, and diminished accuracy.Addressing incomplete data is thus vital before deploying machine learning algorithms, including those employed in the current study, through either the removal of incomplete observations or the imputation of missing values.The datasets used in this research contain several missing values.Discarding columns with missing values might result in losing vital disease-associated genes, and eliminating samples with missing values could compromise feature selection and classifier model performance due to reduced sample size.To address these missing values, we employed the K-Nearest Neighbors (KNN) imputation algorithm, with the number of neighbors set to five.We chose KNNimpute imputation because it is a highly popular and extensively used algorithm for handling missing values in gene expression data, known for its ability to preserve the intrinsic structure of the data [62].
Numerous probes were assigned to the same genes after matching probe IDs with corresponding genes from the GEO2R mapping file.To integrate these probes into a single gene, the mean gene expression values were utilized.In order to integrate training datasets and execute uniform validation on validation sets, we maintained common genes across all datasets.Consequently, 10,113 and 16,413 genes were retained for CRC and IBD analyses, respectively.
Since we are merging several datasets from different populations and platforms, we need to scale the samples to ensure uniformity.In order to identify the optimal scaler, a variety of scalers were applied to the data.Principal component analysis (PCA) was utilized to reduce data dimensionality and visualize the outcomes for each scaler.Fig 6 demonstrates that both rowwise MinMax normalization and quantile normalization yield improved dispersion of CRC datasets.Consequently, the model is better equipped to discern the underlying data patterns related to gene contributions.Given the varying ranges of different genome datasets, row-wise MinMax normalization offers an advantage over a quantile transformer when classifying new datasets or samples.Furthermore, it preserves gene expression correlations while transforming gene expression into a range of 0 to 1 for each instance.Hence, row-wise MinMax scaling was executed prior to feature selection across all datasets.Row-wise MinMax scaling was also applied to transform the IBD datasets.
We also note that the number of cases is much more than the number of controls in both our CRC and IBD training datasets.The high case-to-control ratio undermines the performance of both feature selection and the classification model and can build models that are biased toward cases.To tackle this issue, we employed SMOTE [32] (synthetic minority oversampling technique), an effective oversampling strategy, to use the controls in the training datasets and generate synthetic controls; this way, we equalize the number of cases and controls within the training dataset.

Feature selection
We incorporated the parameters suggested for biological data in the paper as parameters for our algorithm (Th irr = 3, Th red = 4, α = 50, β = 5, and k = 100) [30].Executing the SVFS feature selection algorithm may yield varying subsets of features on each iteration.To achieve consistent results and verify the relationship between the identified genes and the disease, we ran the algorithm 1000 times on the training dataset and identified the top 100 genes that were repeated the most throughout the 1000 runs.Oversampling impacts dataset structure and may alter the gene subset selection by the feature selection algorithm.Consequently, oversampling was conducted before each iteration of feature selection to guarantee a more robust selection.The S2 Fig illustrates the top 100 most repeated genes and their number of occurrences in the CRC (and IBD) train dataset.Once we determined the top 100 genes using SVFS on the training dataset, we discarded all other genes from all the datasets, including the training and validation.Then we trained our ensemble classifier (RF+LR+SVM) on the reduced train dataset.Finally, we validated the model on each of the validation datasets and reported several performance metrics.Among all the performance metrics, the confusion matrix is perhaps the most transparent one and all other metrics are derived from the confusion matrix.
Fig 1 provides a schematic representation of the research pipeline and the various tasks executed for both IBD and CRC.

Fig 1 .
Fig 1.Schematic representation of the research workflow.a Raw datasets are retrieved from GEO, and tabular datasets are generated utilizing gene expression data and probe-gene mapping.b Data processing steps are performed, including discarding unassigned genes, imputing missing values, removing non-common genes, combining identical genes, and scaling each dataset.c After splitting datasets into training and validation sets and merging training sets to form a single set, a 1000-iteration oversampling/feature selection process is applied to identify the most prominent genes.d An ensemble classifier, comprising Random Forest, Support Vector Machine, and Logistic Regression, is trained on the training set.e The results are validated on the validation sets using the trained model, and four performance metrics-accuracy, F1-score, precision-recall, and confusion matrixare employed for the evaluation of case-control sets and recall is employed for case-only sets.https://doi.org/10.1371/journal.pone.0290192.g001 Fig 3, illustrates the results for caseonly datasets.

Fig 2 .
Fig 2. Evaluation of identified CRC genes on independent validation sets.Accuracy and F1-score are plotted for the different number of prominent genes utilized for training and validation.Confusion matrices and precision-recall curves (including AUC) are plotted using the first 40 prominent genes.https://doi.org/10.1371/journal.pone.0290192.g002

Fig 3 .Fig 4 .
Fig 3. Evaluation of the model trained on tumor and matched normal samples on case-only datasets.https://doi.org/10.1371/journal.pone.0290192.g003 Fig 5(a) illustrates the resulting network.As observed in Fig 5(a), 13 IBD-associated genes and 27 CRC-associated genes have direct interaction (without intermediary genes).The GAPDH gene appears to play a pivotal role in linking the two gene subsets and is one of the most crucial CRC-associated genes identified in our subset.To investigate potential interactions between CRC and IBD genes, we extended the network in Fig 5(a) to include intermediary genes that may serve as a bridge between CRC and IBD genes.For this extended network in Fig 5(b), we took into account only single intermediary genes, which is a drawback since the bridge could involve two genes, for example.While single intermediary genes are more influential, other genes with minor additive effects are overlooked.The S5 Table lists the full names of single intermediary genes.Owing to the network's complexity, we preserved genes in Fig 5(b) with a higher number of connections in the network for illustrative purposes.For example, TP53 may be regarded as the most critical

Fig 5 .
Fig 5. IBD and CRC gene interaction networks generated by STRING for identified genes.a Network generated based on CRC and IBD genes without the participation of intermediary genes.b Network generated based on CRC and IBD genes with the participation of intermediary genes.https://doi.org/10.1371/journal.pone.0290192.g005

Table .
As demonstrated in Fig 4, the ensemble classifier effectively distinguished inflamed samples from